!pip3 install plotly
!pip install jupyter_contrib_nbextensions
Requirement already satisfied: plotly in c:\users\user\anaconda3\anaconda3\lib\site-packages (4.14.3) Requirement already satisfied: retrying>=1.3.3 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from plotly) (1.3.3) Requirement already satisfied: six in c:\users\user\anaconda3\anaconda3\lib\site-packages (from plotly) (1.15.0) Requirement already satisfied: jupyter_contrib_nbextensions in c:\users\user\anaconda3\anaconda3\lib\site-packages (0.5.1) Requirement already satisfied: jupyter-highlight-selected-word>=0.1.1 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (0.2.0) Requirement already satisfied: jupyter-latex-envs>=1.3.8 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (1.4.6) Requirement already satisfied: jupyter-nbextensions-configurator>=0.4.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (0.4.1) Requirement already satisfied: pyyaml in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (5.3.1) Requirement already satisfied: tornado in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (6.0.4) Requirement already satisfied: jupyter-contrib-core>=0.3.3 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (0.3.3) Requirement already satisfied: nbconvert>=4.2 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (6.0.7) Requirement already satisfied: jupyter-core in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (4.6.3) Requirement already satisfied: notebook>=4.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (6.1.4) Requirement already satisfied: lxml in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (4.6.1) Requirement already satisfied: traitlets>=4.1 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (5.0.5) Requirement already satisfied: ipython-genutils in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter_contrib_nbextensions) (0.2.0) Requirement already satisfied: ipython in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (7.19.0) Requirement already satisfied: setuptools in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter-contrib-core>=0.3.3->jupyter_contrib_nbextensions) (50.3.1.post20201107) Requirement already satisfied: mistune<2,>=0.8.1 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (0.8.4) Requirement already satisfied: jupyterlab-pygments in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (0.1.2) Requirement already satisfied: defusedxml in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (0.6.0) Requirement already satisfied: entrypoints>=0.2.2 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (0.3) Requirement already satisfied: pygments>=2.4.1 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (2.7.2) Requirement already satisfied: jinja2>=2.4 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (2.11.2) Requirement already satisfied: nbformat>=4.4 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (5.0.8) Requirement already satisfied: bleach in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (3.2.1) Requirement already satisfied: testpath in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (0.4.4) Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (0.5.1) Requirement already satisfied: pandocfilters>=1.4.1 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbconvert>=4.2->jupyter_contrib_nbextensions) (1.4.3) Requirement already satisfied: pywin32>=1.0; sys_platform == "win32" in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter-core->jupyter_contrib_nbextensions) (227) Requirement already satisfied: pyzmq>=17 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_contrib_nbextensions) (19.0.2) Requirement already satisfied: argon2-cffi in c:\users\user\anaconda3\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_contrib_nbextensions) (20.1.0) Requirement already satisfied: terminado>=0.8.3 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_contrib_nbextensions) (0.9.1) Requirement already satisfied: jupyter-client>=5.3.4 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_contrib_nbextensions) (6.1.7) Requirement already satisfied: ipykernel in c:\users\user\anaconda3\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_contrib_nbextensions) (5.3.4) Requirement already satisfied: Send2Trash in c:\users\user\anaconda3\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_contrib_nbextensions) (1.5.0) Requirement already satisfied: prometheus-client in c:\users\user\anaconda3\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_contrib_nbextensions) (0.8.0) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from ipython->jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (3.0.8) Requirement already satisfied: backcall in c:\users\user\anaconda3\anaconda3\lib\site-packages (from ipython->jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (0.2.0) Requirement already satisfied: decorator in c:\users\user\anaconda3\anaconda3\lib\site-packages (from ipython->jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (4.4.2) Requirement already satisfied: jedi>=0.10 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from ipython->jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (0.17.1) Requirement already satisfied: pickleshare in c:\users\user\anaconda3\anaconda3\lib\site-packages (from ipython->jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (0.7.5) Requirement already satisfied: colorama; sys_platform == "win32" in c:\users\user\anaconda3\anaconda3\lib\site-packages (from ipython->jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (0.4.4) Requirement already satisfied: MarkupSafe>=0.23 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jinja2>=2.4->nbconvert>=4.2->jupyter_contrib_nbextensions) (1.1.1) Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbformat>=4.4->nbconvert>=4.2->jupyter_contrib_nbextensions) (3.2.0) Requirement already satisfied: webencodings in c:\users\user\anaconda3\anaconda3\lib\site-packages (from bleach->nbconvert>=4.2->jupyter_contrib_nbextensions) (0.5.1) Requirement already satisfied: six>=1.9.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from bleach->nbconvert>=4.2->jupyter_contrib_nbextensions) (1.15.0) Requirement already satisfied: packaging in c:\users\user\anaconda3\anaconda3\lib\site-packages (from bleach->nbconvert>=4.2->jupyter_contrib_nbextensions) (20.4) Requirement already satisfied: async-generator in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert>=4.2->jupyter_contrib_nbextensions) (1.10) Requirement already satisfied: nest-asyncio in c:\users\user\anaconda3\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert>=4.2->jupyter_contrib_nbextensions) (1.4.2) Requirement already satisfied: cffi>=1.0.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from argon2-cffi->notebook>=4.0->jupyter_contrib_nbextensions) (1.14.3) Requirement already satisfied: pywinpty>=0.5 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from terminado>=0.8.3->notebook>=4.0->jupyter_contrib_nbextensions) (0.5.7) Requirement already satisfied: python-dateutil>=2.1 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jupyter-client>=5.3.4->notebook>=4.0->jupyter_contrib_nbextensions) (2.8.1) Requirement already satisfied: wcwidth in c:\users\user\anaconda3\anaconda3\lib\site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython->jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (0.2.5) Requirement already satisfied: parso<0.8.0,>=0.7.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jedi>=0.10->ipython->jupyter-latex-envs>=1.3.8->jupyter_contrib_nbextensions) (0.7.0) Requirement already satisfied: pyrsistent>=0.14.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.4->nbconvert>=4.2->jupyter_contrib_nbextensions) (0.17.3) Requirement already satisfied: attrs>=17.4.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.4->nbconvert>=4.2->jupyter_contrib_nbextensions) (20.3.0) Requirement already satisfied: pyparsing>=2.0.2 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from packaging->bleach->nbconvert>=4.2->jupyter_contrib_nbextensions) (2.4.7) Requirement already satisfied: pycparser in c:\users\user\anaconda3\anaconda3\lib\site-packages (from cffi>=1.0.0->argon2-cffi->notebook>=4.0->jupyter_contrib_nbextensions) (2.20)
!pip3 install folium
!jupyter nbextension install <url>/toc2.zip --user
!jupyter nbextension enable toc2/main
Requirement already satisfied: folium in c:\users\user\anaconda3\anaconda3\lib\site-packages (0.12.1) Requirement already satisfied: numpy in c:\users\user\anaconda3\anaconda3\lib\site-packages (from folium) (1.19.2) Requirement already satisfied: requests in c:\users\user\anaconda3\anaconda3\lib\site-packages (from folium) (2.24.0) Requirement already satisfied: branca>=0.3.0 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from folium) (0.4.2) Requirement already satisfied: jinja2>=2.9 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from folium) (2.11.2) Requirement already satisfied: idna<3,>=2.5 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from requests->folium) (2.10) Requirement already satisfied: certifi>=2017.4.17 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from requests->folium) (2020.6.20) Requirement already satisfied: chardet<4,>=3.0.2 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from requests->folium) (3.0.4) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from requests->folium) (1.25.11) Requirement already satisfied: MarkupSafe>=0.23 in c:\users\user\anaconda3\anaconda3\lib\site-packages (from jinja2>=2.9->folium) (1.1.1)
The system cannot find the file specified.
Enabling notebook extension toc2/main...
- Validating: ok
## Plotting Libaray
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import folium
from folium.plugins import HeatMap
## Pandas Dataframe Library
import pandas as pd
## Numpy Library
import numpy as np
## Train and Test Split
from sklearn.model_selection import train_test_split
## Evaluation Matrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
## Normalize
from sklearn.preprocessing import MinMaxScaler
## Models
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
## Kfold and ROC
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, confusion_matrix, auc
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV,StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
df=pd.read_csv('feature_data.csv') #reading the file of data
label_df=pd.read_csv('label_data.csv') #reading the file of labels
Exploratory Analysis is term used in data analysis, which is used to explore the data and extract useful information from it.
Exploratory Analysis is carried out once you have all the data collected, cleaned and processed. You find the exact information you need to carry out the analysis or if you need more through manipulating the data.
During this step, one can use various techniques in python (such as functions and plots) to carry out analysis, which leads to the understanding of the data in order to become able to interpret it more effectively and derive better conclusions according to the requirements.
df['cancelation'] = label_df['cancelation'] #the column "cancelation" of labels = the column "cancelation" in the data file
Checking the new column:
df.head()
| Unnamed: 0 | time_until_order | order_year | order_month | order_week | order_day_of_month | adults | children | babies | country | ... | anon_feat_5 | anon_feat_6 | anon_feat_7 | anon_feat_8 | anon_feat_9 | anon_feat_10 | anon_feat_11 | anon_feat_12 | anon_feat_13 | cancelation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 51014 | 309.0 | 2016 | May | week_20 | 13 | 2 | 0.0 | 0 | PRT | ... | 0.0 | 215.0 | 0.0 | 0 | 0.0 | 0.250606 | 17.588299 | True | 1.0 | True |
| 1 | 28536 | 3.0 | 2016 | October | week_41 | 2 | 2 | 0.0 | 0 | ESP | ... | 3.0 | 0.0 | 1.0 | 1 | 1.0 | 0.444719 | 2.343371 | True | NaN | False |
| 2 | 21745 | NaN | 2017 | March | week_12 | 19 | 1 | 0.0 | 0 | DEU | ... | 4.0 | 0.0 | 0.0 | 0 | 1.0 | 0.598733 | 2.498820 | True | NaN | False |
| 3 | 17502 | 153.0 | 2015 | September | week_40 | 29 | 2 | 0.0 | 0 | GBR | ... | 3.0 | 0.0 | 0.0 | 0 | 1.0 | 0.335675 | 12.411559 | True | NaN | False |
| 4 | 83295 | 33.0 | 2016 | January | week_5 | 25 | 2 | 0.0 | 0 | BRA | ... | 0.0 | 15.0 | 0.0 | 0 | 0.0 | 0.492874 | 5.743378 | True | NaN | False |
5 rows × 35 columns
print("Row's number:", df.shape[0]) #the 0 axis is the rows
print("Columns's number:", df.shape[1]) #the 1 axis is the columns
Row's number: 89542 Columns's number: 35
Another way for checking the dimensionality of the model:
df.shape
(89542, 35)
Checking columns names in the data file:
df.columns
Index(['Unnamed: 0', 'time_until_order', 'order_year', 'order_month',
'order_week', 'order_day_of_month', 'adults', 'children', 'babies',
'country', 'order_type', 'acquisition_channel', 'prev_canceled',
'prev_not_canceled', 'changes', 'deposit_type', 'agent', 'company',
'customer_type', 'adr', 'anon_feat_0', 'anon_feat_1', 'anon_feat_2',
'anon_feat_3', 'anon_feat_4', 'anon_feat_5', 'anon_feat_6',
'anon_feat_7', 'anon_feat_8', 'anon_feat_9', 'anon_feat_10',
'anon_feat_11', 'anon_feat_12', 'anon_feat_13', 'cancelation'],
dtype='object')
Exploring the type of columns:
df.dtypes
Unnamed: 0 int64 time_until_order float64 order_year int64 order_month object order_week object order_day_of_month int64 adults int64 children float64 babies int64 country object order_type object acquisition_channel object prev_canceled int64 prev_not_canceled int64 changes float64 deposit_type object agent float64 company float64 customer_type object adr float64 anon_feat_0 float64 anon_feat_1 int64 anon_feat_2 int64 anon_feat_3 int64 anon_feat_4 int64 anon_feat_5 float64 anon_feat_6 float64 anon_feat_7 float64 anon_feat_8 int64 anon_feat_9 float64 anon_feat_10 float64 anon_feat_11 float64 anon_feat_12 bool anon_feat_13 float64 cancelation bool dtype: object
Informations about the data:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 89542 entries, 0 to 89541 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 89542 non-null int64 1 time_until_order 76861 non-null float64 2 order_year 89542 non-null int64 3 order_month 86108 non-null object 4 order_week 89542 non-null object 5 order_day_of_month 89542 non-null int64 6 adults 89542 non-null int64 7 children 89538 non-null float64 8 babies 89542 non-null int64 9 country 85201 non-null object 10 order_type 89542 non-null object 11 acquisition_channel 89542 non-null object 12 prev_canceled 89542 non-null int64 13 prev_not_canceled 89542 non-null int64 14 changes 86065 non-null float64 15 deposit_type 80536 non-null object 16 agent 77346 non-null float64 17 company 5062 non-null float64 18 customer_type 79647 non-null object 19 adr 86559 non-null float64 20 anon_feat_0 86161 non-null float64 21 anon_feat_1 89542 non-null int64 22 anon_feat_2 89542 non-null int64 23 anon_feat_3 89542 non-null int64 24 anon_feat_4 89542 non-null int64 25 anon_feat_5 85510 non-null float64 26 anon_feat_6 85309 non-null float64 27 anon_feat_7 85294 non-null float64 28 anon_feat_8 89542 non-null int64 29 anon_feat_9 85811 non-null float64 30 anon_feat_10 86810 non-null float64 31 anon_feat_11 84585 non-null float64 32 anon_feat_12 89542 non-null bool 33 anon_feat_13 5776 non-null float64 34 cancelation 89542 non-null bool dtypes: bool(2), float64(14), int64(12), object(7) memory usage: 22.7+ MB
df.describe() #describing the data
| Unnamed: 0 | time_until_order | order_year | order_day_of_month | adults | children | babies | prev_canceled | prev_not_canceled | changes | ... | anon_feat_3 | anon_feat_4 | anon_feat_5 | anon_feat_6 | anon_feat_7 | anon_feat_8 | anon_feat_9 | anon_feat_10 | anon_feat_11 | anon_feat_13 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 89542.000000 | 76861.000000 | 89542.000000 | 89542.000000 | 89542.000000 | 89538.000000 | 89542.000000 | 89542.000000 | 89542.000000 | 86065.000000 | ... | 89542.000000 | 89542.000000 | 85510.000000 | 85309.000000 | 85294.000000 | 89542.000000 | 85811.000000 | 86810.000000 | 84585.000000 | 5776.000000 |
| mean | 59716.762871 | 103.673879 | 2016.157658 | 15.828807 | 1.857497 | 0.103732 | 0.007896 | 0.087411 | 0.137701 | 0.223877 | ... | 0.032231 | 0.989971 | 1.330944 | 2.339401 | 0.062607 | 0.571922 | 0.335691 | 0.427146 | 8.845679 | 0.365132 |
| std | 34495.242240 | 106.940156 | 0.707461 | 8.779753 | 0.565296 | 0.397797 | 0.095194 | 0.849799 | 1.496269 | 0.663361 | ... | 0.176613 | 1.698086 | 1.879927 | 17.516854 | 0.243415 | 0.793567 | 0.472234 | 0.128140 | 5.236673 | 0.481509 |
| min | 0.000000 | 0.000000 | 2015.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.161008 | 0.038632 | 0.000000 |
| 25% | 29838.250000 | 18.000000 | 2016.000000 | 8.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.328012 | 4.452191 | 0.000000 |
| 50% | 59743.500000 | 69.000000 | 2016.000000 | 16.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.425622 | 8.422255 | 0.000000 |
| 75% | 89610.500000 | 159.000000 | 2017.000000 | 23.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 3.000000 | 3.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.511077 | 12.712815 | 1.000000 |
| max | 119388.000000 | 737.000000 | 2017.000000 | 31.000000 | 55.000000 | 10.000000 | 10.000000 | 26.000000 | 72.000000 | 21.000000 | ... | 1.000000 | 9.000000 | 11.000000 | 391.000000 | 3.000000 | 5.000000 | 1.000000 | 0.907525 | 27.172399 | 1.000000 |
8 rows × 26 columns
Checking how many booking cancellation were done:
df['cancelation'].value_counts()
False 56346 True 33196 Name: cancelation, dtype: int64
In this pie graph we can see the percentage of bookings that were cancelled and bookings that were not:
1 indicates a cancelled booking
0 indicates a not cancelled booking
cancellation_graph = px.pie(df, values=df['cancelation'].value_counts().values, names=df['cancelation'].value_counts().index,
title='Cancelation' , color_discrete_sequence=px.colors.sequential.Peach)
cancellation_graph.update_traces(textposition='inside', textinfo='percent+label')
cancellation_graph.show()
Checking the amount of orders for every type:
df['order_type'].value_counts()
Online TA 42450 Offline TA/TO 18154 Groups 14762 Direct 9487 Corporate 3957 Complementary 560 Aviation 170 Undefined 2 Name: order_type, dtype: int64
In this pie graph we see the percentages for each type of orders:
We can notice that most people order "online TA"
order_types = px.pie(df, values=df['order_type'].value_counts().values, names=df['order_type'].value_counts().index,
title='The type of Orders' ,color_discrete_sequence=px.colors.sequential.Mint
)
order_types.update_traces(textposition='inside', textinfo='percent+label')
order_types.show()
Checking which country citizens book the most:
df['country'].value_counts().head(1)
PRT 34804 Name: country, dtype: int64
Top 10 countries:
df['country'].value_counts().head(10)
PRT 34804 GBR 8676 FRA 7448 ESP 6170 DEU 5280 ITA 2654 IRL 2412 BEL 1713 BRA 1605 USA 1523 Name: country, dtype: int64
Bar plot describing the amount of citizens from countries:
fig = go.Figure(data=[go.Bar(
x=df['country'].value_counts().index[0:10], y=df['country'].value_counts().values[0:10],
text=df['country'].value_counts().values[0:10],
textposition='outside',marker_color='lightseagreen'
)])
fig.show()
Pie plot describing the amount of citizens from countries:
countries = px.pie(df, values=df['country'].value_counts().values, names=df['country'].value_counts().index,
title='Countries' ,color_discrete_sequence=px.colors.sequential.RdPu)
countries.update_traces(textposition='inside', textinfo='percent+label')
countries.show()
Creating a new feature thats give us the total number of guests: (includes adults/children/babies)
df['guest'] = df['adults'] + df['children'] + df['babies']
Showing the top 5 countries with guests that didnt cancell their booking:
#cancelation==Flase in order to fetch only the orders that were not cancelled
guest_per_country = df[df['cancelation'] == False]['country'].value_counts().reset_index()
guest_per_country.columns = ['country', 'guest'] #selecting this two columns
guest_per_country.head()
| country | guest | |
|---|---|---|
| 0 | PRT | 15060 |
| 1 | GBR | 6903 |
| 2 | FRA | 6084 |
| 3 | ESP | 4630 |
| 4 | DEU | 4369 |
A world map showing the countries with the number of guests:
Yellow indicates the largest number of guests
map_of_countries = folium.Map()
guest_map = px.choropleth(guest_per_country, locations = guest_per_country['country'],color = guest_per_country['guest'],
hover_name = guest_per_country['country'])
guest_map.show()
Bar plot that indicates the number of orders that didnt be cancelled by each month .
We can see that August and July has most order, we notice that its summer vacation so most of the families prefer to go for a vacation
month_order = df[df['cancelation'] == False]['order_month'].value_counts().reset_index()
# Fetching only the orders that were not cancelled
graph_order_month = px.scatter(x=month_order['index'], y=month_order['order_month'])
# axis X marks the month
# And axis Y -the number of orders
graph_order_month = go.Figure(data=[go.Bar(x=month_order['index'], y=month_order['order_month'],
text=month_order['order_month'],textposition='outside',marker_color='green')])
graph_order_month.show()
month_order = df[df['cancelation'] == False]['order_month'].value_counts()
order_and_month = px.pie(df, values=month_order, names=df['order_month'].value_counts().index,
title='Orders and months' ,color_discrete_sequence=px.colors.sequential.BuGn)
order_and_month.update_traces(textposition='inside', textinfo='percent+label')
order_and_month.show()
We can see in this bar plot that the highest ADR is in August and we also can notice that most cancelations happens in the same month,we can conclude that adr is the reason of that.
plt.figure(figsize=(15,10)) #size of graph
sns.barplot(x='order_month', y='adr', hue='cancelation', palette= 'summer', data=df)
plt.title('Order Month vs ADR vs Booking Cancellation Status')
Text(0.5, 1.0, 'Order Month vs ADR vs Booking Cancellation Status')
Crosstab showing the number of customer types that cancelled their booking:
We can see that people with transient cancel the most
pd.crosstab([df["cancelation"]], df["customer_type"],margins = True).style.background_gradient(cmap = "gist_gray")
| customer_type | Contract | Group | Transient | Transient-Party | All |
|---|---|---|---|---|---|
| cancelation | |||||
| False | 1834 | 355 | 35516 | 12386 | 50091 |
| True | 837 | 37 | 24463 | 4219 | 29556 |
| All | 2671 | 392 | 59979 | 16605 | 79647 |
Catplot that shows the cancelations/not according to the customer type:
sns.catplot(x='customer_type', col = 'cancelation', data=df, kind = 'count', palette='Set2') #countplot
<seaborn.axisgrid.FacetGrid at 0x16fc49a2820>
Crosstab showing the number of deposit types that cancelled their booking:
We can see that most of the cancellations are of those booking where
no deposit
pd.crosstab([df["cancelation"]], df["deposit_type"],margins = True).style.background_gradient(cmap = "Oranges")
| deposit_type | No Deposit | Non Refund | Refundable | All |
|---|---|---|---|---|
| cancelation | ||||
| False | 50421 | 59 | 94 | 50574 |
| True | 20126 | 9811 | 25 | 29962 |
| All | 70547 | 9870 | 119 | 80536 |
Catplot that shows the cancelations/not according to the deposite type:
sns.catplot(x="deposit_type", col = 'cancelation', data=df, kind = 'count', palette='rainbow')
<seaborn.axisgrid.FacetGrid at 0x16fc4cb5fa0>
adverage_of_cancellation = df.groupby(['order_year'])['cancelation'].mean()
print("Cancellation Percantage per year",adverage_of_cancellation*100)
Cancellation Percantage per year order_year 2015 37.249423 2016 35.860333 2017 38.663789 Name: cancelation, dtype: float64
Barplot showing the adverage of cancellation by year:
cancell_year =go.Figure(data=[go.Bar(x=adverage_of_cancellation.index, y=adverage_of_cancellation*100,text=adverage_of_cancellation*100,
textposition='outside',marker_color='grey')])
cancell_year.show()
(sns.FacetGrid(df, hue = 'cancelation',height = 6,xlim = (0,500)).map(sns.kdeplot, 'time_until_order', shade = True)
.add_legend());
#we can notice that the peak of cancellations is close to 50 days for order, also we can notice that people who order in short time tend not to cancell.
df_num = df.select_dtypes(include=np.number)
df_num.hist(figsize=(15,15))
plt.show()
#we can notice that anon_feat_10 may be normalized but the rest is not
A preliminary processing of data in order to prepare it for the primary processing or for further analysis.
#The explanations of the correlation is done with the other corellation section
## correlaion before null values'S handiling
features = df.columns ## Fetching all Features Column names
## Applying Pearson Correaltion
mask = np.zeros_like(df[features].corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
## Creating a Plot Diagram
f, ax = plt.subplots(figsize=(16, 12))
## Title of Plot
plt.title('Pearson Correlation Matrix before null handiling',fontsize=27)
sns.heatmap(df[features].corr(),linewidths=0.25,vmax=0.7,square=True,cmap="RdGy",
linecolor='w',annot=True,annot_kws={"size":8},mask=mask,cbar_kws={"shrink": .9});
df.corr().style.background_gradient(cmap='coolwarm')
#The cell with darker culur indicates a stronger corellation
| Unnamed: 0 | time_until_order | order_year | order_day_of_month | adults | children | babies | prev_canceled | prev_not_canceled | changes | agent | company | adr | anon_feat_0 | anon_feat_1 | anon_feat_2 | anon_feat_3 | anon_feat_4 | anon_feat_5 | anon_feat_6 | anon_feat_7 | anon_feat_8 | anon_feat_9 | anon_feat_10 | anon_feat_11 | anon_feat_12 | anon_feat_13 | cancelation | guest | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Unnamed: 0 | 1.000000 | 0.008234 | 0.299310 | 0.010057 | -0.005796 | -0.023063 | -0.028345 | -0.018364 | -0.000844 | -0.004441 | -0.615504 | -0.208787 | 0.020074 | -0.149813 | -0.205640 | 0.041075 | -0.019378 | -0.157052 | -0.195246 | -0.013731 | -0.133586 | 0.106112 | -0.818237 | -0.011636 | 0.004953 | -0.001602 | -0.255275 | -0.242871 | -0.021291 |
| time_until_order | 0.008234 | 1.000000 | 0.039087 | 0.003031 | 0.123699 | -0.039179 | -0.021315 | 0.084876 | -0.074639 | 0.000295 | -0.071875 | 0.156485 | 0.012048 | 0.087673 | 0.165793 | -0.002684 | -0.125870 | -0.104356 | -0.170992 | 0.172199 | -0.115189 | -0.094990 | -0.078964 | -0.487098 | 0.959980 | -0.005159 | 0.291412 | 0.294502 | 0.072419 |
| order_year | 0.299310 | 0.039087 | 1.000000 | 0.001116 | 0.034312 | 0.051760 | -0.012514 | -0.119836 | 0.029331 | 0.030669 | 0.062462 | 0.255305 | 0.024363 | 0.019992 | 0.029429 | 0.065655 | 0.009743 | 0.092226 | 0.035632 | -0.056072 | -0.014540 | 0.108936 | -0.034409 | -0.039040 | 0.050678 | -0.002689 | 0.014805 | 0.014949 | 0.054598 |
| order_day_of_month | 0.010057 | 0.003031 | 0.001116 | 1.000000 | -0.001731 | 0.015827 | 0.001337 | -0.024917 | 0.001862 | 0.010334 | -0.000040 | 0.050604 | 0.043146 | -0.015037 | -0.028578 | -0.005779 | -0.006582 | 0.015121 | 0.009192 | 0.021225 | 0.006901 | 0.003584 | 0.000472 | -0.010269 | 0.007953 | -0.001133 | 0.008743 | -0.006737 | 0.007684 |
| adults | -0.005796 | 0.123699 | 0.034312 | -0.001731 | 1.000000 | 0.032568 | 0.020909 | -0.007315 | -0.111358 | -0.049944 | -0.035456 | 0.199110 | 0.125008 | 0.095898 | 0.091632 | 0.024537 | -0.151878 | 0.216493 | 0.147987 | -0.009634 | 0.019174 | 0.130470 | 0.010017 | -0.124635 | 0.151967 | 0.003792 | 0.053173 | 0.058103 | 0.815968 |
| children | -0.023063 | -0.039179 | 0.051760 | 0.015827 | 0.032568 | 1.000000 | 0.025264 | -0.024775 | -0.021466 | 0.050719 | 0.042756 | 0.030002 | 0.154710 | 0.047129 | 0.044691 | -0.050536 | -0.033760 | 0.377861 | 0.329290 | -0.033714 | 0.056907 | 0.080463 | 0.046501 | -0.002811 | -0.028192 | -0.001040 | 0.030702 | 0.005691 | 0.588676 |
| babies | -0.028345 | -0.021315 | -0.012514 | 0.001337 | 0.020909 | 0.025264 | 1.000000 | -0.007703 | -0.007084 | 0.081313 | 0.037577 | 0.009572 | 0.016577 | 0.017176 | 0.019489 | 0.003212 | -0.009158 | 0.041805 | 0.043314 | -0.010851 | 0.034375 | 0.099443 | 0.043540 | 0.009081 | -0.020595 | 0.007887 | -0.030580 | -0.032331 | 0.164627 |
| prev_canceled | -0.018364 | 0.084876 | -0.119836 | -0.024917 | -0.007315 | -0.024775 | -0.007703 | 1.000000 | 0.147278 | -0.027251 | -0.012652 | -0.187248 | -0.031825 | -0.015218 | -0.015059 | -0.003083 | 0.081758 | -0.049984 | -0.059345 | 0.006438 | -0.018536 | -0.048199 | 0.013700 | -0.033175 | 0.076986 | 0.007404 | 0.108098 | 0.109633 | -0.020703 |
| prev_not_canceled | -0.000844 | -0.074639 | 0.029331 | 0.001862 | -0.111358 | -0.021466 | -0.007084 | 0.147278 | 1.000000 | 0.009769 | 0.020546 | -0.212025 | -0.040119 | -0.043565 | -0.049105 | -0.040490 | 0.419006 | -0.021786 | 0.003749 | -0.009848 | 0.042869 | 0.038668 | 0.005575 | 0.078774 | -0.096237 | 0.005830 | -0.067665 | -0.060068 | -0.101481 |
| changes | -0.004441 | 0.000295 | 0.030669 | 0.010334 | -0.049944 | 0.050719 | 0.081313 | -0.027251 | 0.009769 | 1.000000 | 0.066958 | 0.129196 | 0.042877 | 0.055910 | 0.098425 | 0.025598 | 0.009176 | 0.045317 | 0.096657 | -0.012623 | 0.063982 | 0.054734 | 0.070476 | -0.005321 | 0.006169 | -0.002519 | -0.141534 | -0.144559 | -0.000553 |
| agent | -0.615504 | -0.071875 | 0.062462 | -0.000040 | -0.035456 | 0.042756 | 0.037577 | -0.012652 | 0.020546 | 0.066958 | 1.000000 | 0.348193 | 0.018900 | 0.144238 | 0.182138 | -0.050271 | 0.029123 | 0.211083 | 0.238588 | -0.055327 | 0.178723 | 0.031228 | 0.790402 | 0.049566 | -0.073886 | -0.002113 | -0.094053 | -0.081911 | 0.006045 |
| company | -0.208787 | 0.156485 | 0.255305 | 0.050604 | 0.199110 | 0.030002 | 0.009572 | -0.187248 | -0.212025 | 0.129196 | 0.348193 | 1.000000 | 0.072142 | 0.065422 | 0.186355 | 0.119467 | -0.241961 | 0.034927 | 0.092052 | -0.008472 | -0.008917 | -0.107560 | 0.355493 | -0.149590 | 0.194627 | 0.002753 | 0.023725 | -0.012482 | 0.189886 |
| adr | 0.020074 | 0.012048 | 0.024363 | 0.043146 | 0.125008 | 0.154710 | 0.016577 | -0.031825 | -0.040119 | 0.042877 | 0.018900 | 0.072142 | 1.000000 | 0.216139 | 0.243888 | 0.031311 | -0.069512 | 0.185827 | 0.126162 | -0.033824 | 0.018265 | 0.136532 | 0.019577 | -0.062123 | 0.042914 | -0.005026 | -0.085282 | -0.070436 | 0.188005 |
| anon_feat_0 | -0.149813 | 0.087673 | 0.019992 | -0.015037 | 0.095898 | 0.047129 | 0.017176 | -0.015218 | -0.043565 | 0.055910 | 0.144238 | 0.065422 | 0.216139 | 1.000000 | 0.501159 | 0.045912 | -0.088181 | 0.142874 | 0.086890 | -0.054308 | -0.015567 | 0.075290 | 0.188555 | -0.131908 | 0.136377 | 0.001395 | 0.014841 | -0.000038 | 0.104353 |
| anon_feat_1 | -0.205640 | 0.165793 | 0.029429 | -0.028578 | 0.091632 | 0.044691 | 0.019489 | -0.015059 | -0.049105 | 0.098425 | 0.182138 | 0.186355 | 0.243888 | 0.501159 | 1.000000 | 0.037282 | -0.098017 | 0.168091 | 0.101210 | -0.001089 | -0.024434 | 0.067968 | 0.233295 | -0.182914 | 0.221843 | 0.003176 | 0.037926 | 0.026997 | 0.100461 |
| anon_feat_2 | 0.041075 | -0.002684 | 0.065655 | -0.005779 | 0.024537 | -0.050536 | 0.003212 | -0.003083 | -0.040490 | 0.025598 | -0.050271 | 0.119467 | 0.031311 | 0.045912 | 0.037282 | 1.000000 | -0.057257 | -0.120749 | -0.120068 | -0.008728 | -0.038884 | 0.024830 | -0.010596 | -0.021945 | 0.012804 | -0.001637 | -0.021949 | -0.016972 | -0.008321 |
| anon_feat_3 | -0.019378 | -0.125870 | 0.009743 | -0.006582 | -0.151878 | -0.033760 | -0.009158 | 0.081758 | 0.419006 | 0.009176 | 0.029123 | -0.241961 | -0.069512 | -0.088181 | -0.098017 | -0.057257 | 1.000000 | -0.031990 | 0.030655 | -0.022720 | 0.069486 | 0.009119 | 0.051930 | 0.162063 | -0.171925 | 0.003590 | -0.082977 | -0.085612 | -0.140845 |
| anon_feat_4 | -0.157052 | -0.104356 | 0.092226 | 0.015121 | 0.216493 | 0.377861 | 0.041805 | -0.049984 | -0.021786 | 0.045317 | 0.211083 | 0.034927 | 0.185827 | 0.142874 | 0.168091 | -0.120749 | -0.031990 | 1.000000 | 0.814067 | -0.069679 | 0.137030 | 0.138020 | 0.250781 | 0.016870 | -0.082272 | 0.000712 | -0.063609 | -0.060772 | 0.389078 |
| anon_feat_5 | -0.195246 | -0.170992 | 0.035632 | 0.009192 | 0.147987 | 0.329290 | 0.043314 | -0.059345 | 0.003749 | 0.096657 | 0.238588 | 0.092052 | 0.126162 | 0.086890 | 0.101210 | -0.120068 | 0.030655 | 0.814067 | 1.000000 | -0.070303 | 0.164445 | 0.126637 | 0.307804 | 0.085888 | -0.164706 | 0.000512 | -0.164383 | -0.176803 | 0.307157 |
| anon_feat_6 | -0.013731 | 0.172199 | -0.056072 | 0.021225 | -0.009634 | -0.033714 | -0.010851 | 0.006438 | -0.009848 | -0.012623 | -0.055327 | -0.008472 | -0.033824 | -0.054308 | -0.001089 | -0.008728 | -0.022720 | -0.069679 | -0.070303 | 1.000000 | -0.033073 | -0.083733 | -0.073289 | -0.077078 | 0.159428 | -0.000483 | 0.085877 | 0.058219 | -0.027910 |
| anon_feat_7 | -0.133586 | -0.115189 | -0.014540 | 0.006901 | 0.019174 | 0.056907 | 0.034375 | -0.018536 | 0.042869 | 0.063982 | 0.178723 | -0.008917 | 0.018265 | -0.015567 | -0.024434 | -0.038884 | 0.069486 | 0.137030 | 0.164445 | -0.033073 | 1.000000 | 0.081367 | 0.221654 | 0.100472 | -0.135518 | 0.005027 | -0.191681 | -0.197533 | 0.052284 |
| anon_feat_8 | 0.106112 | -0.094990 | 0.108936 | 0.003584 | 0.130470 | 0.080463 | 0.099443 | -0.048199 | 0.038668 | 0.054734 | 0.031228 | -0.107560 | 0.136532 | 0.075290 | 0.067968 | 0.024830 | 0.009119 | 0.138020 | 0.126637 | -0.083733 | 0.081367 | 1.000000 | 0.041421 | 0.010021 | -0.077557 | 0.004910 | -0.218565 | -0.233985 | 0.162010 |
| anon_feat_9 | -0.818237 | -0.078964 | -0.034409 | 0.000472 | 0.010017 | 0.046501 | 0.043540 | 0.013700 | 0.005575 | 0.070476 | 0.790402 | 0.355493 | 0.019577 | 0.188555 | 0.233295 | -0.010596 | 0.051930 | 0.250781 | 0.307804 | -0.073289 | 0.221654 | 0.041421 | 1.000000 | 0.071100 | -0.083305 | 0.002088 | -0.125310 | -0.136527 | 0.039907 |
| anon_feat_10 | -0.011636 | -0.487098 | -0.039040 | -0.010269 | -0.124635 | -0.002811 | 0.009081 | -0.033175 | 0.078774 | -0.005321 | 0.049566 | -0.149590 | -0.062123 | -0.131908 | -0.182914 | -0.021945 | 0.162063 | 0.016870 | 0.085888 | -0.077078 | 0.100472 | 0.010021 | 0.071100 | 1.000000 | -0.581297 | -0.000731 | -0.203804 | -0.204077 | -0.099658 |
| anon_feat_11 | 0.004953 | 0.959980 | 0.050678 | 0.007953 | 0.151967 | -0.028192 | -0.020595 | 0.076986 | -0.096237 | 0.006169 | -0.073886 | 0.194627 | 0.042914 | 0.136377 | 0.221843 | 0.012804 | -0.171925 | -0.082272 | -0.164706 | 0.159428 | -0.135518 | -0.077557 | -0.083305 | -0.581297 | 1.000000 | -0.003582 | 0.308467 | 0.313919 | 0.102087 |
| anon_feat_12 | -0.001602 | -0.005159 | -0.002689 | -0.001133 | 0.003792 | -0.001040 | 0.007887 | 0.007404 | 0.005830 | -0.002519 | -0.002113 | 0.002753 | -0.005026 | 0.001395 | 0.003176 | -0.001637 | 0.003590 | 0.000712 | 0.000512 | -0.000483 | 0.005027 | 0.004910 | 0.002088 | -0.000731 | -0.003582 | 1.000000 | -0.029618 | -0.004417 | 0.003469 |
| anon_feat_13 | -0.255275 | 0.291412 | 0.014805 | 0.008743 | 0.053173 | 0.030702 | -0.030580 | 0.108098 | -0.067665 | -0.141534 | -0.094053 | 0.023725 | -0.085282 | 0.014841 | 0.037926 | -0.021949 | -0.082977 | -0.063609 | -0.164383 | 0.085877 | -0.191681 | -0.218565 | -0.125310 | -0.203804 | 0.308467 | -0.029618 | 1.000000 | 1.000000 | 0.055176 |
| cancelation | -0.242871 | 0.294502 | 0.014949 | -0.006737 | 0.058103 | 0.005691 | -0.032331 | 0.109633 | -0.060068 | -0.144559 | -0.081911 | -0.012482 | -0.070436 | -0.000038 | 0.026997 | -0.016972 | -0.085612 | -0.060772 | -0.176803 | 0.058219 | -0.197533 | -0.233985 | -0.136527 | -0.204077 | 0.313919 | -0.004417 | 1.000000 | 1.000000 | 0.045016 |
| guest | -0.021291 | 0.072419 | 0.054598 | 0.007684 | 0.815968 | 0.588676 | 0.164627 | -0.020703 | -0.101481 | -0.000553 | 0.006045 | 0.189886 | 0.188005 | 0.104353 | 0.100461 | -0.008321 | -0.140845 | 0.389078 | 0.307157 | -0.027910 | 0.052284 | 0.162010 | 0.039907 | -0.099658 | 0.102087 | 0.003469 | 0.055176 | 0.045016 | 1.000000 |
Changing the type of "Cancelation" column from bool into int by changing "True" into 1 and "False" into 0:
df['cancelation'].replace({True:1,False:0},inplace=True)
df.isnull().sum() #checking how many null values for every feature
Unnamed: 0 0 time_until_order 12681 order_year 0 order_month 3434 order_week 0 order_day_of_month 0 adults 0 children 4 babies 0 country 4341 order_type 0 acquisition_channel 0 prev_canceled 0 prev_not_canceled 0 changes 3477 deposit_type 9006 agent 12196 company 84480 customer_type 9895 adr 2983 anon_feat_0 3381 anon_feat_1 0 anon_feat_2 0 anon_feat_3 0 anon_feat_4 0 anon_feat_5 4032 anon_feat_6 4233 anon_feat_7 4248 anon_feat_8 0 anon_feat_9 3731 anon_feat_10 2732 anon_feat_11 4957 anon_feat_12 0 anon_feat_13 83766 cancelation 0 guest 4 dtype: int64
handling null values using interpolation:
Interpolation is handle both object and numeric values easily.
(Interplation is exaplained in the Report)
Two interpolation method will be used
## Liner interpolation will be applied to handle data linearly
df = df.interpolate()
## padding Interpolation applied to handle the values missed by Linear Interpolation,
#padding interpolation, specify a limit that limit is the maximum number of nans the method can fill consecutively.
df = df.interpolate(method='pad', limit=15) #‘pad’: Fill in NaNs using existing values.
df = df.replace([np.inf, -np.inf], np.nan) ## convert inifite values into Nan Values
df = df.dropna(how="any") #drop all the nan values
df.isnull().sum() #Check if there are zero values the data
Unnamed: 0 0 time_until_order 0 order_year 0 order_month 0 order_week 0 order_day_of_month 0 adults 0 children 0 babies 0 country 0 order_type 0 acquisition_channel 0 prev_canceled 0 prev_not_canceled 0 changes 0 deposit_type 0 agent 0 company 0 customer_type 0 adr 0 anon_feat_0 0 anon_feat_1 0 anon_feat_2 0 anon_feat_3 0 anon_feat_4 0 anon_feat_5 0 anon_feat_6 0 anon_feat_7 0 anon_feat_8 0 anon_feat_9 0 anon_feat_10 0 anon_feat_11 0 anon_feat_12 0 anon_feat_13 0 cancelation 0 guest 0 dtype: int64
Will use cat labelencoder to convert them easily:
(also exaplained in the Report)
The categorical type is a process of factorization. Meaning that each unique value or category is given a incremented integer value starting from zero.
df['country'] =df['country'].astype('category').cat.codes
df['order_type'] =df['order_type'].astype('category').cat.codes
df['acquisition_channel'] =df['acquisition_channel'].astype('category').cat.codes
df['deposit_type'] =df['deposit_type'].astype('category').cat.codes
df['customer_type'] =df['customer_type'].astype('category').cat.codes
df['order_week'] =df['order_week'].astype('category').cat.codes
Converting Data columns into one date column:
Datetime function select's specific columns like year, month and day so predefined columns name will be change:
df = df.rename(columns={'order_year': 'year', 'order_month': 'month', 'order_day_of_month': 'day'})
#We will rename the columns to use "to_datetime" that converts to numbers
Few of the rows has wrong date entry like June month is of 30 days but here there are entries for 31,So we will drop these entries:
result = df.loc[df['month'].isin(['June', 'September','November','April']),['month','day'] ] [df['day'] == 31]
result = result.index
df = df.drop(result)
#months that have 30 days
<ipython-input-45-6264e8b544c4>:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
result1 = df.loc[df['month'].isin(['February']),['month','day'] ] [(df["day"] == 31) | (df["day"] == 30) | (df["day"] == 29) ]
result1 = result1.index
#months that have 28 days
df = df.drop(result1)
<ipython-input-46-a0f098d23697>:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
now the months are with the correct num of days
Creating Date Object:
df['DATE'] = pd.to_datetime(df.year.astype(str) + '/' + df.month.astype(str) + df.day.astype(str))
Model can't train on date so to handle it each date will be converted into index:
df=df.set_index(df.DATE)
df = df.drop(['Unnamed: 0','year','month','day','DATE'], axis = 1)
Using Z-score to find the outliers
df = df.astype(float) #changing values into float
from scipy import stats
outliers = np.abs(stats.zscore(df))
print(outliers)
[[1.00357364 1.50656764 1.51693954 ... 0.91114642 0.76759718 1.36305332] [0.29469714 0.23573215 0.25186062 ... 0.91114642 0.76759718 0.04335304] [0.98419646 1.28111203 0.25186062 ... 0.91114642 0.76759718 0.04335304] ... [0.62571871 0.23573215 1.51693954 ... 0.91114642 0.76759718 1.36305332] [0.06217103 0.16604016 0.25186062 ... 0.91114642 1.30276665 0.04335304] [0.76943795 0.73995573 0.25186062 ... 0.91114642 0.76759718 0.04335304]]
threshold = 3 #we assume that the outlier's Value of Variable is Greater than 3
print(np.where(outliers > 3)) #printing the indices where outliers>3
(array([ 0, 0, 6, ..., 89406, 89412, 89412], dtype=int64), array([13, 19, 14, ..., 18, 3, 23], dtype=int64))
Removing Outliers:
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
Checking The New Dataset without Outliers:
df.shape
(65827, 32)
Apply Person Correlation to find the highly correlated features
The correlation coefficient has values between -1 to 1 —
features = df.columns ## Fetching all Features Column names
## Applying Pearson Correaltion
mask = np.zeros_like(df[features].corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
## Creating a Plot Diagram
f, ax = plt.subplots(figsize=(16, 12))
## Title of Plot
plt.title('Pearson Correlation Matrix',fontsize=27)
sns.heatmap(df[features].corr(),linewidths=0.25,vmax=0.7,square=True,cmap="OrRd",
linecolor='w',annot=True,annot_kws={"size":8},mask=mask,cbar_kws={"shrink": .9});
High Correlated Variables tends to have smiliar inforamtion which tend to bring down the performace of model so highly correlated features will be removed from the model
relevant_features = mask[mask>0.8] ## selecting features with 80% correlation
corr_matrix = df.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
Following features were removed from the dataset because of their high corellation:
to_drop = [column for column in upper.columns if any(upper[column] > 0.80)]
df = df.drop(df[to_drop], axis=1)
to_drop
['anon_feat_11', 'guest']
#splitting
x=df.drop(['cancelation'], axis = 1)
y=df.cancelation
On machine learning, the performance of a model only benefits from more features up until a certain point. The more features are fed into a model, the more the dimensionality of the data increases. As the dimensionality increases, overfitting becomes more likely.
The Dimensonality Problem usually occur when there is high number of features as they can directly effect the model prediction.
Here the dataset consist of only 35 features and does not have high dimensonality plus by using feature importance only the improtant feautres are being used to train the model hence demionsality reduction is not required for this dataset.
Since there are number of anontated features and we dont know what they actualy repsent to, we will apply feature imporatance technique to check the importance of each feature and its effect on the model training.
clf = RandomForestClassifier(n_estimators=100, random_state=0)
clf.fit(x, y)
RandomForestClassifier(random_state=0)
feature_scores = pd.Series(clf.feature_importances_, index=x.columns).sort_values(ascending=False)
feature_scores
country 0.114353 time_until_order 0.101610 deposit_type 0.093515 anon_feat_10 0.075339 adr 0.073756 company 0.060873 order_week 0.059893 agent 0.058540 anon_feat_8 0.054488 order_type 0.050623 anon_feat_13 0.037529 prev_canceled 0.032511 anon_feat_1 0.030943 anon_feat_5 0.023831 anon_feat_0 0.020756 customer_type 0.020187 changes 0.017125 anon_feat_4 0.013101 adults 0.012354 anon_feat_2 0.012251 anon_feat_12 0.011000 anon_feat_9 0.009463 acquisition_channel 0.009131 children 0.003690 anon_feat_6 0.001666 anon_feat_7 0.000873 prev_not_canceled 0.000598 anon_feat_3 0.000000 babies 0.000000 dtype: float64
Here we can see that Babies has the least affect on the prediction of model ,but country has the most affect .
also there are many anontated features that have effect on the model prediction so those will be used and left for model training
f, ax = plt.subplots(figsize=(15, 7))
sns.barplot(x=feature_scores, y=feature_scores.index)
ax.set_title("The importance of Features")
ax.set_yticklabels(feature_scores.index)
ax.set_xlabel("Feature importance score")
ax.set_ylabel("Features")
plt.show()
So for the Model Top 20 Features are selected
## Selecting top 20 features based on the ranking
features = feature_scores.index[0:20]
x = x[features]
x
| country | time_until_order | deposit_type | anon_feat_10 | adr | company | order_week | agent | anon_feat_8 | order_type | anon_feat_13 | prev_canceled | anon_feat_1 | anon_feat_5 | anon_feat_0 | customer_type | changes | anon_feat_4 | adults | anon_feat_2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DATE | ||||||||||||||||||||
| 2015-08-29 | 124.0 | 134.0 | 0.0 | 0.409388 | 140.0 | 506.34375 | 28.0 | 240.0 | 2.0 | 6.0 | 0.0 | 0.0 | 3.0 | 4.0 | 2.0 | 2.0 | 0.0 | 4.0 | 2.0 | 0.0 |
| 2016-11-27 | 149.0 | 2.0 | 0.0 | 0.516904 | 8989.0 | 500.68750 | 43.0 | 9.0 | 1.0 | 6.0 | 0.0 | 0.0 | 2.0 | 0.0 | 2.0 | 2.0 | 0.0 | 0.0 | 2.0 | 0.0 |
| 2016-04-13 | 124.0 | 19.0 | 0.0 | 0.421692 | 154.0 | 495.03125 | 7.0 | 10.0 | 0.0 | 6.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 3.0 | 3.0 |
| 2016-03-05 | 75.0 | 145.0 | 0.0 | 0.326480 | 803.0 | 489.37500 | 1.0 | 9.0 | 0.0 | 6.0 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 | 0.0 | 0.0 | 2.0 | 0.0 |
| 2016-09-21 | 75.0 | 271.0 | 0.0 | 0.357281 | 10133.0 | 483.71875 | 32.0 | 12.0 | 1.0 | 5.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 2.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2017-05-21 | 27.0 | 21.0 | 0.0 | 0.321296 | 1575.0 | 342.00000 | 13.0 | 8.0 | 1.0 | 6.0 | 0.0 | 0.0 | 2.0 | 0.0 | 2.0 | 3.0 | 0.0 | 0.0 | 2.0 | 0.0 |
| 2016-04-14 | 40.0 | 90.0 | 0.0 | 0.398224 | 9095.0 | 342.00000 | 7.0 | 9.0 | 1.0 | 6.0 | 0.0 | 0.0 | 3.0 | 0.0 | 2.5 | 2.0 | 0.0 | 0.0 | 2.0 | 0.0 |
| 2015-08-23 | 124.0 | 39.0 | 0.0 | 0.387626 | 1296.0 | 342.00000 | 28.0 | 240.0 | 1.0 | 6.0 | 0.0 | 0.0 | 5.0 | 0.0 | 3.0 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2015-08-17 | 124.0 | 110.0 | 0.0 | 0.323147 | 134.0 | 342.00000 | 27.0 | 240.0 | 1.0 | 6.0 | 0.0 | 0.0 | 5.0 | 0.0 | 2.0 | 2.0 | 0.0 | 0.0 | 2.0 | 0.0 |
| 2017-05-28 | 124.0 | 183.0 | 0.0 | 0.286674 | 1395.0 | 342.00000 | 14.0 | 9.0 | 1.0 | 6.0 | 0.0 | 0.0 | 0.0 | 3.0 | 2.0 | 3.0 | 0.0 | 3.0 | 2.0 | 0.0 |
65827 rows × 20 columns
Feature Scaling or Standardization: It is a step of Data Pre Processing which is applied to independent variables or features of data. It basically helps to normalise the data within a particular range. Sometimes, it also helps in speeding up the calculations in an algorithm. So, Normalization is imporant to scale all data equally for beter results
We can see from looking at the data that its not normalized.
For scaling we are using minmax
# Get column names first
names = x.columns
# Create the Scaler object
sc = MinMaxScaler()
# Fit your data on the scaler object
x = sc.fit_transform(x)
x = pd.DataFrame(x, columns=names)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.35, random_state = 47)
print ("X_train: ", len(X_train))
print("X_test: ", len(X_test))
print("y_train: ", len(y_train))
print("y_test: ", len(y_test))
X_train: 42787 X_test: 23040 y_train: 42787 y_test: 23040
test_df = pd.read_csv ('feature_data_test.csv') #loading the data
Creating function in order to preprocess:
def test_preprocess(df,features):
df = df.interpolate()
df = df.interpolate(method='pad', limit=15)
df = df.replace([np.inf, -np.inf], np.nan) ## convert inifite values into Nan Values
df = df.dropna(how="any")
df['country'] =df['country'].astype('category').cat.codes
df['order_type'] =df['order_type'].astype('category').cat.codes
df['acquisition_channel'] =df['acquisition_channel'].astype('category').cat.codes
df['deposit_type'] =df['deposit_type'].astype('category').cat.codes
df['customer_type'] =df['customer_type'].astype('category').cat.codes
df['order_week'] =df['order_week'].astype('category').cat.codes
df = df[features]
return df
test_df = test_preprocess(test_df,features)
Method Follow three Steps:
These are functions that we called in the simple models and advanced models:
def models(alg, X_train, X_test, y_train, y_test,test_df):
model = alg
model_alg = model.fit(X_train, y_train)
global y_probablity, y_pred, test_prob #global variables in order to not be deleted at the end of function
y_probablity = model_alg.predict_proba(X_test)[:,1] #predicting probability of label
y_pred = model_alg.predict(X_test) #predicting label
test_prob = model_alg.predict_proba(test_df)[:,1] #testing the real test
train_pred = model_alg.predict(X_train)
name = type(model).__name__
nn_cm = confusion_matrix(y_test, y_pred) # Creating the confusion matrix
# Visualization:
f, ax = plt.subplots(figsize=(5,5))
sns.heatmap(nn_cm, annot=True, linewidth=0.7, linecolor='olive', fmt='.0f', ax=ax, cmap='YlGnBu')
plt.title(name)
plt.xlabel('y_pred')
plt.ylabel('y_test')
plt.show()
def Check(model): #making AUC for every model, the final AUC is the mean of k folds
tprs = []
aucs = []
mean_fpr = np.linspace(0,1,100)
i = 1
fig1 = plt.figure(figsize=[12,12])
cv = KFold(n_splits=5, random_state=7, shuffle=True)
i=1
for train_index, test_index in cv.split(x): #checking for every k fold
#iloc : helps us select a value that belongs to a particular row or column
X_train = x.iloc[train_index]
X_test = x.iloc[test_index]
y_train = y.iloc[train_index]
y_test = y.iloc[test_index]
model.fit(X_train, y_train) # Run Models
prediction = model.predict_proba(X_test)
fpr, tpr, t = roc_curve(y_test, prediction[:, 1]) #making ROC
tprs.append(np.interp(mean_fpr, fpr, tpr))
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
print("Training Data Accuracy:", model.score(X_train,y_train)*100)
print("Test Data Accuracy:", model.score(X_test,y_test)*100)
plt.plot(fpr, tpr, lw=2, alpha=0.3, label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
i= i+1
plt.plot([0,1],[0,1],linestyle = '--',lw = 2,color = 'black')
mean_tpr = np.mean(tprs, axis=0)
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, color='blue',
label=r'Mean ROC (AUC = %0.2f )' % (mean_auc),lw=2, alpha=1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.text(0.32,0.7,'More accurate area',fontsize = 12)
plt.text(0.63,0.4,'Less accurate area',fontsize = 12)
plt.show()
Parameters:
knn = KNeighborsClassifier(n_neighbors=3)
Check(knn)
Training Data Accuracy: 87.61322420766791 Test Data Accuracy: 77.0621297280875 Training Data Accuracy: 87.61322420766791 Test Data Accuracy: 77.0545344068054 Training Data Accuracy: 87.76347271277201 Test Data Accuracy: 76.82491454614508 Training Data Accuracy: 87.74448368842809 Test Data Accuracy: 77.36422331940751 Training Data Accuracy: 87.6875166153963 Test Data Accuracy: 76.78693505507026
In This Model:
models(KNeighborsClassifier(n_neighbors=3), X_train, X_test, y_train, y_test,test_df)
test_pred = pd.DataFrame() #making a DataFrame for the test predictions
test_pred['KNN Prediction'] = test_prob #Adding the predections for this specific alg
lr = LogisticRegression()
Check(lr)
Training Data Accuracy: 76.5196255293291 Test Data Accuracy: 76.99377183654869 Training Data Accuracy: 76.60697670002469 Test Data Accuracy: 76.80388880449644 Training Data Accuracy: 76.60172420341043 Test Data Accuracy: 76.57424990505127 Training Data Accuracy: 76.68147810565493 Test Data Accuracy: 76.2932016710976 Training Data Accuracy: 76.70426493486765 Test Data Accuracy: 76.2324344853779
In This Model:
models(LogisticRegression(), X_train, X_test, y_train, y_test,test_df)
test_pred['LR Prediction'] = test_prob #Adding the predections for this specific alg
Parameters:
bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False
rf = RandomForestClassifier()
Check(rf)
Training Data Accuracy: 100.0 Test Data Accuracy: 84.39161476530457 Training Data Accuracy: 99.99810106150662 Test Data Accuracy: 84.62706972504937 Training Data Accuracy: 100.0 Test Data Accuracy: 84.73984048613748 Training Data Accuracy: 99.99810109756561 Test Data Accuracy: 84.8841625522218 Training Data Accuracy: 100.0 Test Data Accuracy: 84.08659323965058
In This Model:
models(RandomForestClassifier(), X_train, X_test, y_train, y_test,test_df)
test_pred['RF Prediction'] = test_prob #Adding the predections for this specific alg
Parameters:
activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08, hidden_layer_sizes=(100,), learning_rate='constant', learning_rate_init=0.001, max_fun=15000, max_iter=200, momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5, random_state=None, shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1, verbose=False, warm_start=False
mlp = MLPClassifier()
Check(mlp)
C:\Users\user\Anaconda3\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:582: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
Training Data Accuracy: 82.65889367843376 Test Data Accuracy: 80.93574358195352
C:\Users\user\Anaconda3\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:582: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
Training Data Accuracy: 81.2992537171721 Test Data Accuracy: 80.35849916451467
C:\Users\user\Anaconda3\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:582: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
Training Data Accuracy: 82.50541187193802 Test Data Accuracy: 81.12419293581466
C:\Users\user\Anaconda3\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:582: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
Training Data Accuracy: 82.19399187269758 Test Data Accuracy: 81.46600835548804
C:\Users\user\Anaconda3\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:582: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
Training Data Accuracy: 82.68390870077096 Test Data Accuracy: 81.32168628940373
**In This Model
models(MLPClassifier(), X_train, X_test, y_train, y_test,test_df)
C:\Users\user\Anaconda3\Anaconda3\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:582: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
test_pred['MLP Prediction'] = test_prob #Adding the predections for this specific alg
The Confustion Metric and The K-Fold Cross Validation is added in the Models/Advance Models Part.
If we look at the accuacry difference of training and test in every K-Fold we can detect that although there was little biase toward the training data but Overall Models are Not Overfitting Like MLP, SVC, Logistic Regression etc
Confusion metric is used to evaluate how much the model has predicted correctly, It combine true label and predicted label and gave its evaluation in four ways:
Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.
Cross-validation procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation.
In addition to this method we use ROC curve/AUC in order to evaluate the model quality.
AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve,AUC (Area under the ROC Curve).
AUC provides an aggregate measure of performance across all possible classification thresholds. So, we are looking for higher AUC that indicates a better model.
Under-fitting and over-fitting are two problems of machine learning. A model usually underperforms due to one of these reasons.
Under fitting happens when the model is too simple i.e. it contains less features to be trained or is regularized too much that the model couldn’t learn anything from the dataset which leads to less variance and too much biasness in predicting wrong outcomes. While on other hand over-fitting in models occur when they are trained so much on the training data that they eventually fail on providing a good prediction on any general unseen dataset (test data).
Both of these issues do not have any fixed solution but can be prevented through number of ways which are implemented in our model i.e:
test_pred
| KNN Prediction | LR Prediction | RF Prediction | MLP Prediction | |
|---|---|---|---|---|
| 0 | 0.333333 | 1.000000e+00 | 0.56 | 1.0 |
| 1 | 0.333333 | 0.000000e+00 | 0.40 | 1.0 |
| 2 | 0.666667 | 1.000000e+00 | 0.52 | 1.0 |
| 3 | 0.333333 | 1.000000e+00 | 0.56 | 1.0 |
| 4 | 0.333333 | 1.000000e+00 | 0.43 | 1.0 |
| ... | ... | ... | ... | ... |
| 29825 | 0.333333 | 0.000000e+00 | 0.43 | 1.0 |
| 29826 | 0.333333 | 1.000000e+00 | 0.71 | 1.0 |
| 29827 | 0.666667 | 5.164296e-224 | 0.48 | 1.0 |
| 29828 | 0.666667 | 0.000000e+00 | 0.46 | 1.0 |
| 29829 | 0.666667 | 1.000000e+00 | 0.50 | 1.0 |
29830 rows × 4 columns
We choose "Random Forest" model, because it gave us the highest AUC(=92) which indicates to better prediction.
Choosing Random Forest predictions to submit on our output file:
test_pred['RF Prediction']
0 0.56
1 0.40
2 0.52
3 0.56
4 0.43
...
29825 0.43
29826 0.71
29827 0.48
29828 0.46
29829 0.50
Name: RF Prediction, Length: 29830, dtype: float64
test_pred['RF Prediction'].to_csv("submission_group_12.csv")
These models we tryed during making the project, they were excluded because of their low AUC.
All codes are written in Markdown and their not a part of the workflow.
Parameters:
priors=None
,var_smoothing=1e-09
Code:
nb = GaussianNB()
tprs = []
aucs = []
mean_fpr=np.linspace(0,1,100)
i=1
fig1 = plt.figure(figsize=[12,12])
cv = KFold(n_splits=5, random_state=7, shuffle=True)
for train_index, test_index in cv.split(x):
X_train = x.iloc[train_index]
X_test = x.iloc[test_index]
y_train = y.iloc[train_index]
y_test = y.iloc[test_index]
nb.fit(X_train, y_train) # Run Models
prediction = nb.predict_proba(X_test)
fpr, tpr, t = roc_curve(y_test, prediction[:, 1])
tprs.append(np.interp(mean_fpr, fpr, tpr))
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
print("Training Data Accuracy:", nb.score(X_train,y_train)100)
print("Test Data Accuracy:", nb.score(X_test,y_test)100)
plt.plot(fpr, tpr, lw=2, alpha=0.3, label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
i= i+1
plt.plot([0,1],[0,1],linestyle = '--',lw = 2,color = 'black')
mean_tpr = np.mean(tprs, axis=0)
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, color='blue',
label=r'Mean ROC (AUC = %0.2f )' % (mean_auc),lw=2, alpha=1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.text(0.32,0.7,'More accurate area',fontsize = 12)
plt.text(0.63,0.4,'Less accurate area',fontsize = 12)
plt.show()
In This Model:
Code:
models(GaussianNB(), X_train, X_test, y_train, y_test,test_df)
Parameters:
ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=None, splitter='best'
Code:
dt = DecisionTreeClassifier()
scores = []
y_preds =[]
tprs = []
aucs = []
mean_fpr=np.linspace(0,1,100)
i=1
fig1 = plt.figure(figsize=[12,12])
cv = KFold(n_splits=3, random_state=47, shuffle=True)
for train_index, test_index in cv.split(x):
X_train = x.iloc[train_index]
X_test = x.iloc[test_index]
y_train = y.iloc[train_index]
y_test = y.iloc[test_index]
dt.fit(X_train, y_train) # Run Models
prediction = dt.predict_proba(X_test)
fpr, tpr, t = roc_curve(y_test, prediction[:, 1])
tprs.append(np.interp(mean_fpr, fpr, tpr))
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
scores.append(dt.score(X_test, y_test))
print("Training Data Accuracy:", dt.score(X_train,y_train)100)
print("Test Data Accuracy:", dt.score(X_test,y_test)100)
y_preds.append(dt.predict(X_test))
plt.plot(fpr, tpr, lw=2, alpha=0.3, label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
i= i+1
plt.plot([0,1],[0,1],linestyle = '--',lw = 2,color = 'black')
mean_tpr = np.mean(tprs, axis=0)
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, color='blue',
label=r'Mean ROC (AUC = %0.2f )' % (mean_auc),lw=2, alpha=1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.text(0.32,0.7,'More accurate area',fontsize = 12)
plt.text(0.63,0.4,'Less accurate area',fontsize = 12)
plt.show()
In This Model:
Code:
models(DecisionTreeClassifier(), X_train, X_test, y_train, y_test,test_df)
Parameters:
C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False
Code:
svc = SVC(probability=True)
scores = []
y_preds =[]
tprs = []
aucs = []
mean_fpr=np.linspace(0,1,100)
i=1
fig1 = plt.figure(figsize=[12,12])
cv = KFold(n_splits=3, random_state=47, shuffle=True)
for train_index, test_index in cv.split(x):
X_train = x.iloc[train_index]
X_test = x.iloc[test_index]
y_train = y.iloc[train_index]
y_test = y.iloc[test_index]
svc.fit(X_train, y_train) # Run Models
prediction = svc.predict_proba(X_test)
fpr, tpr, t = roc_curve(y_test, prediction[:, 1])
tprs.append(np.interp(mean_fpr, fpr, tpr))
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
scores.append(dt.score(X_test, y_test))
print("Training Data Accuracy:", svc.score(X_train,y_train)100)
print("Test Data Accuracy:", svc.score(X_test,y_test)100)
y_preds.append(svc.predict(X_test))
plt.plot(fpr, tpr, lw=2, alpha=0.3, label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
i= i+1
plt.plot([0,1],[0,1],linestyle = '--',lw = 2,color = 'black')
mean_tpr = np.mean(tprs, axis=0)
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, color='blue',
label=r'Mean ROC (AUC = %0.2f )' % (mean_auc),lw=2, alpha=1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.text(0.32,0.7,'More accurate area',fontsize = 12)
plt.text(0.63,0.4,'Less accurate area',fontsize = 12)
plt.show()
In This Model:
Code:
models(SVC(probability=True), X_train, X_test, y_train, y_test,test_df)
All The visulations and steps are explained also in the report.